Outline of problem * Include the regular HW project cover page. Start with a one paragraph abstract, followed by an intro/background of the problem, methods, results, discussion/conclusion and acknowledgments, references, in that order. Clearly state the problem you have chosen to investigate. List the resources you used to come up with the project and reference all sources you used to complete the project. * Clearly state your hypotheses, prior to interrogating the data. * Use statistical techniques from the list of techniques we have discussed in the course to convey whether or not there is statistical evidence in support of your original hypotheses. * Explicitly state your approach to answer your research hypotheses. Write all formulas/tests/statistics you need. * Interpret your statistical (numerical) results in a lay back language. Write conclusions and discussions at the end of your report and acknowledge outside help. * * Describe how this project can be extended in the future. * One, two or three people can work on a project as a team. If people team up, everyone must contribute equally to the project and all members must submit separate copies of the project, with their names on top (the names of all team members should be on all submissions). Expectations of team projects are higher. * It’s strongly recommended that you design your study, implement your analytic protocol, conduct that modeling, package and report the findings using RMD e-notebook.
In the past decade, Lake Erie has seen high concentrations of cyanobactera, or bluegreen algae. A Severity index was created to rank the algal blooms that occur each year, with the highest severities occuring in 2011 and 2015 with 10 and 10.5 respectively. Not all of the causes of the algal blooms have been determined, however, through research many causes have been identified. These include nutrient-rich water from waste water treatment plants, farm fields and fertilized lawns, invasive species, and warm shallow water in the lake. Furthermore, scientist consider nitrogen in the form of nitrate, and phosphorus to be the main culprit in bluegreen algae growth. (Dean, 2022)
To reduce the risk of harmful algal blooms, the stats of Michigan has planned to focus on reducing phosphorus loads from waste water treatment plants, and agricultural sources in the River Raisin and Maumee River Watersheds. Furthermore, forming collaborative partnerships to provide assistance to farmers and promote conservation practices. Currently local and state focus is on reducing the growth of harmful algae, but implementation of new policy takes time. (Dean, 2022)
To assist in research several buoys were placed in Lake Erie which take multiple water quality parameters that report to research labs throughout the area. Several of these labs also include field sampling data of physicochemical properties along with bluegreen algae concentrations. Using this data, a predictive model can be trained to predict harmful algal bloom concentrations and determine if the concentration is harmful to human and enviromental health.
Data were pulled into rstudio by reading html tables using the rvest package from the ERDDAP scientific database. This database houses data for water quality parameters provided from buoys, field sampling, and laboratory tests. Data were pulled for the year of 2022, although due to time matching, the data within the time periods from August to November were used.
The water quality parameters were chosen based on availablity and significance. Some important parameters of note are chlorophyll mass and flourescense, dissolved oxygen saturation mass and fractional, and phycocyanin flourescence. Looking further into these parameters, chlorophyll is used by bluegreen algae to collect photosynthetically active light and therefore may be important in predicting algae concentration (Robert A. Andersen, n.d.). Dissolved oxygen has been known to be depleted during periods of high algal bloom growth which can affect the growth of aquatic plants and animals (Ting-ting Wu, 2015). finally, Phycocyanin is a non-toxic, water-soluble pigment protein from microalgae that exhibits antioxidant, anti-inflammatory, hepatoprotective, and neuroprotective effects (Morais, 2018).
Since the dataset is about 1000 observations, and the predictive model will require large amounts of data to train, imputation was used rather than removing the columns containing missing data. The Amelia package was used to impute the missing data. The Amelia package imputes data by using the expectation maximization algorithm with bootstrapping. Bootstrapping is a method of inferring results for a population from results found on a collection of smaller random samples of that population, using replacement during the sampling process. This algorithm works by computing the expected value of the log likelihood function with respect to the conditional distribution of Y given X using the parameter estimates of the previous iteration. This is shown as: \(Q( \theta | \theta^{(t)} ) = E_{Y | X, \theta^{(t)} }[ log \left ( L(\theta | X , Y ) \right ])\) For the maximization step, the expectation is maximized before being used again in the expectation equation. The maximization equations is shown as: \((\theta^{(t+1)}=\arg\max_{\theta}Q(\theta|\theta^{(t)}).)\)
Amelia will create copies of the dataset with new imputed values. The number of copies created will depend on the value for “m” entered. Further analysis is done on each of the “m” datasets so that a variance can be calculated. Distributions are then plotted to compare the original data distribution with the imputed distribution for each of the imputed features to validate the imputation.
A correlation and p-value matrix was generated for each feature by using the rcorr() function. The Spearman method was used due to its accuracy in both linear and non linear data. Pairs plots were created for each of the features to access linearity between features. This was done by using the plotly function with a “splom” input.
Each features importance was calculated using the Boruta package. Boruta is a feature selection function which utilizes random forest classification to determine the importance of each feature. This is done by first creating “shadow features” by copying and randomizing each original feature before appending to the original dataframe. This gives a dataframe twice the size of the original. Following this, boruta builds a Random Forest Classifier on the new feature space which determines their importance using a statistical test, known as the Z-Score.This algorithm checks if the original feature has higher importance than the maximum importance of shadow features, \((Z-Score_{original} > Z-Score_{Max\, shadow})\). If the importance is found to be higher then the feature is recorded as important, otherwise it is recorded as unimportant.
To ensure better optimization in the deep learning neural network, the data was pre-processed by normalization. By normalizing all of the data to values between 0 and 1, the deep learning network will be less likely to get trapped in local extrema caused by highly flucuating values. Instead, the algorithm will have shallower extrema and should be able to converge easier. For normalization, the following function was built, \(x_{norm}=\left( \frac{x - min(x))}{(max(x) - min(x)} \right)\).
Using the normalized data, training and testing data was created by taking random samples in a 90:10 split. This higher split was used due to using 20% of the training data as a validation step within the neural network. The output variable (bluegreen algae concentration) was left out of the training and testing data, but each (train/test) output was stored for cross validation.
The Deep learning neural network was built using the python wrapped Keras package. The network consisted of an input layer of 10 nodes (layer 1), followed by a hidden layer of 10 nodes (layer 2), a hidden layer of 120 nodes (layer 3), a dropout layer with a 30% rate (layer 4), and finally a layer with a single output node (layer 5). In each layer, the neural network utilizes an activation function which decides whether a neuron should be activated or not by calculating the weighted sum and further adding bias to it. Beginning with layer 2, the activation functions for each layer are relu, relu, and linear. The relu function stands for “rectified linear unit” and is a piecewise linear function that will output the input directly if it is positive, otherwise, it will output zero. The linear function, also known as “no activation” is where the activation is proportional to the input and simply returns the value it was given. The dropout layer is used to approximate training a large number of neural networks with different architectures in parallel. During training, a number of layer outputs are randomly ignored which has the effect of making the layer be treated like a layer with a different number of nodes and connectivity to the prior layer. This process attempts to create situations where network layers co-adapt to correct mistakes from prior layers, in turn making the model more robust.
The model was compiled using the mean-squared error for both the loss function as well as the metric. The mean-squared error is given by \(MSE=\left( \frac{1}{n} \right)\sum_{i=1}^{n}(Y_i-Y'_i)^2\), where n is the number of data points, \(Y_i\) is the observed value, and \(Y'_i\) is the predicted value. This function measures error in statistical models by using the average squared difference between observed and predicted values which tells how close a regression line is to a set of points. Furthermore, Adam was chosen as the optimizer for the model which is a replacement optimization algorithm for stochastic gradient descent. Adam combines properties of the AdaGrad and RMSProp algorithms to create an algorithm that can handle sparse gradients on noisy data.
After compliation, the model was fitted using the normalized training data input along with the normalized training data output (bluegreen algae concentration) to be used for validation. A 20% validation split was created from this dataset and validated at each of the 200 epochs.
Using the normalized testing data created earlier, the model was evaluated and the correlation to the original data was plotted to show the linearity. The predicted and observed data were then categorized by the danger level of algae (safe, caution, danger) which was decided to be (x<0.6, 0.6<x<1, x>1). These values are based on hazardous levels of the toxins produced by the algae. A confusion matrix was created to determine the accuracy of the classification.
| …1 | time | longitude | latitude | chlorophyll_fluorescence | fractional_saturation_of_oxygen_in_sea_water | mass_concentration_of_blue_green_algae_in_sea_water | mass_concentration_of_blue_green_algae_in_sea_water_rfu | mass_concentration_of_chlorophyll_in_sea_water | mass_concentration_of_oxygen_in_sea_water | sea_surface_temperature | sea_water_electrical_conductivity | sea_water_ph_reported_on_total_scale | ammonia | phosphate | phycocyanin_fluorescence | nitrate | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. : 1.0 | Min. :2022-08-04 16:20:00 | Min. :82.94 | Min. :41.51 | Min. :0.1400 | Min. : 9.40 | Min. :0.0000 | Min. :0.220 | Min. : 0.000 | Min. :0.000790 | Min. :280.9 | Min. :0.02449 | Min. :7.350 | Min. : 0.0 | Min. : 0.00 | Min. : 0.1600 | Min. : 0 | |
| 1st Qu.:247.8 | 1st Qu.:2022-08-15 08:25:00 | 1st Qu.:82.94 | 1st Qu.:41.51 | 1st Qu.:0.5800 | 1st Qu.: 75.26 | 1st Qu.:0.3600 | 1st Qu.:0.670 | 1st Qu.: 1.650 | 1st Qu.:0.006287 | 1st Qu.:292.0 | 1st Qu.:0.02650 | 1st Qu.:7.980 | 1st Qu.: 35.0 | 1st Qu.: 48.00 | 1st Qu.: 0.4600 | 1st Qu.: 2184 | |
| Median :494.5 | Median :2022-09-04 09:20:00 | Median :82.94 | Median :41.51 | Median :0.8000 | Median : 88.14 | Median :0.5100 | Median :0.820 | Median : 2.745 | Median :0.007465 | Median :296.9 | Median :0.02792 | Median :8.135 | Median : 46.0 | Median : 93.00 | Median : 0.7500 | Median : 6195 | |
| Mean :494.5 | Mean :2022-09-07 11:16:14 | Mean :82.94 | Mean :41.51 | Mean :0.9249 | Mean : 79.63 | Mean :0.6323 | Mean :0.949 | Mean : 3.338 | Mean :0.007200 | Mean :294.6 | Mean :0.02789 | Mean :8.129 | Mean : 900.8 | Mean : 86.47 | Mean : 0.9437 | Mean : 9018 | |
| 3rd Qu.:741.2 | 3rd Qu.:2022-09-25 03:30:00 | 3rd Qu.:82.94 | 3rd Qu.:41.51 | 3rd Qu.:1.1025 | 3rd Qu.: 93.74 | 3rd Qu.:0.7300 | 3rd Qu.:1.050 | 3rd Qu.: 4.188 | 3rd Qu.:0.008452 | 3rd Qu.:298.0 | 3rd Qu.:0.02881 | 3rd Qu.:8.310 | 3rd Qu.: 63.0 | 3rd Qu.: 115.00 | 3rd Qu.: 1.2400 | 3rd Qu.: 8422 | |
| Max. :988.0 | Max. :2022-10-31 05:30:00 | Max. :82.94 | Max. :41.51 | Max. :4.0100 | Max. :113.60 | Max. :5.1800 | Max. :5.610 | Max. :18.350 | Max. :0.011280 | Max. :300.2 | Max. :0.03693 | Max. :8.880 | Max. :89750.0 | Max. :1965.00 | Max. :13.1100 | Max. :128660 | |
| NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA’s :3 | NA | NA’s :39 | NA’s :4 |
From Table 1, there appears to be missing data in three of the columns. From the missingness map, we can see that there is only a small amount of missing data with the majority of it being in the nitrate data. To train a predictive model the missing data will need to be either removed or imputed.
| chlorophyll_flourescence_rfu | oxygen_saturation_fraction | bluegreen_algae_conc_ug.L | bluegreen_algae_conc_rfu | chlorophyll_conc_kg.m3 | oxygen_conc_kg.m3 | temp_K | elec_cond_s.m | pH | ammonia_mg.L | phosphate_mg.L | phycocayanin_flour_rfu | nitrate_mg.L | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. :0.1400 | Min. : 9.40 | Min. :0.006497 | Min. :0.220 | Min. : 0.010 | Min. :0.000790 | Min. :280.9 | Min. :0.02449 | Min. :7.350 | Min. : 0.00100 | Min. :0.00100 | Min. : 0.1600 | Min. : 0.005 | |
| 1st Qu.:0.5800 | 1st Qu.: 75.26 | 1st Qu.:0.360000 | 1st Qu.:0.670 | 1st Qu.: 1.650 | 1st Qu.:0.006287 | 1st Qu.:292.0 | 1st Qu.:0.02650 | 1st Qu.:7.980 | 1st Qu.: 0.03575 | 1st Qu.:0.04900 | 1st Qu.: 0.4600 | 1st Qu.: 2.299 | |
| Median :0.8000 | Median : 88.14 | Median :0.510000 | Median :0.820 | Median : 2.745 | Median :0.007465 | Median :296.9 | Median :0.02792 | Median :8.135 | Median : 0.04700 | Median :0.09400 | Median : 0.7800 | Median : 6.530 | |
| Mean :0.9249 | Mean : 79.63 | Mean :0.632264 | Mean :0.949 | Mean : 3.338 | Mean :0.007200 | Mean :294.6 | Mean :0.02789 | Mean :8.129 | Mean : 0.94250 | Mean :0.08704 | Mean : 0.9649 | Mean : 9.594 | |
| 3rd Qu.:1.1025 | 3rd Qu.: 93.74 | 3rd Qu.:0.730000 | 3rd Qu.:1.050 | 3rd Qu.: 4.188 | 3rd Qu.:0.008452 | 3rd Qu.:298.0 | 3rd Qu.:0.02881 | 3rd Qu.:8.310 | 3rd Qu.: 0.06500 | 3rd Qu.:0.11500 | 3rd Qu.: 1.3100 | 3rd Qu.: 8.682 | |
| Max. :4.0100 | Max. :113.60 | Max. :5.180000 | Max. :5.610 | Max. :18.350 | Max. :0.011280 | Max. :300.2 | Max. :0.03693 | Max. :8.880 | Max. :89.75000 | Max. :1.96500 | Max. :13.1100 | Max. :128.660 |
imp_list <- list(
data.frame(initial = algae_data$nitrate/1000, imputed = algae_sub$nitrate_mg.L),
data.frame(initial = algae_data$phycocyanin_fluorescence, imputed = algae_sub$phycocayanin_flour_rfu),
data.frame(initial = algae_data$phosphate/1000, imputed = algae_sub$phosphate_mg.L),
data.frame(initial = algae_data$ammonia/1000, imputed = algae_sub$ammonia_mg.L)
)
names <- c("nitrate","phycocyanin","phosphate","ammonia")
plotting <- function(df){
p<- ggplot(df)+
geom_density(aes(x=initial, fill = "Initial"),alpha=0.7)+
geom_density(aes(x=imputed, fill = "Imputed"),alpha=0.5)+
labs(x="Concentration")
ggplotly(p)
}
lapply(imp_list, plotting)## Warning: Removed 86 rows containing non-finite values (stat_density).
## Warning: Removed 39 rows containing non-finite values (stat_density).
## Warning: Removed 8 rows containing non-finite values (stat_density).
## Warning: Removed 14 rows containing non-finite values (stat_density).
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
box <- algae_sub %>% select(c("nitrate_mg.L","phosphate_mg.L","ammonia_mg.L"))
ggplot(stack(box), aes(x = ind, y = values)) +
geom_boxplot() ## Randomly samples the indexes to create training and testing data
sub <- sample(nrow(dl_norm_output), floor(nrow(dl_norm_output)*0.90))
## Index the input and ouput datasets with the randomly generated indexes to get training and testing inputs/outputs
dl_train_input <- dl_norm_input[sub, ]
dl_train_output <-dl_norm_output[sub, ]
dl_test_input <- dl_norm_input[-sub, ]
dl_test_output <-dl_norm_output[-sub, ]
## Turn all training and testing data into matrices
dl_train_mat <- as.matrix(dl_train_input)
dl_test_mat <- as.matrix(dl_test_input)
dl_train_output <- as.matrix(dl_train_output)
dl_test_output <-as.matrix(dl_test_output)
## Remove column names
colnames(dl_train_mat) <- NULL
colnames(dl_test_mat) <- NULL## loss mse
## 0.002494538 0.002494538
## warning_obs
## warning_pred Caution Danger Safe
## Caution 17 6 13
## Danger 0 8 1
## Safe 2 2 50